library(tidyverse) # for graphing and data cleaning
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels) # for modeling
## Warning: package 'tidymodels' was built under R version 4.0.4
## -- Attaching packages -------------------------------------- tidymodels 0.1.2 --
## v broom 0.7.3 v recipes 0.1.15
## v dials 0.0.9 v rsample 0.0.9
## v infer 0.5.4 v tune 0.1.3
## v modeldata 0.1.0 v workflows 0.2.2
## v parsnip 0.1.5 v yardstick 0.0.7
## Warning: package 'dials' was built under R version 4.0.4
## Warning: package 'infer' was built under R version 4.0.4
## Warning: package 'modeldata' was built under R version 4.0.4
## Warning: package 'parsnip' was built under R version 4.0.4
## Warning: package 'recipes' was built under R version 4.0.4
## Warning: package 'rsample' was built under R version 4.0.4
## Warning: package 'tune' was built under R version 4.0.4
## Warning: package 'workflows' was built under R version 4.0.4
## Warning: package 'yardstick' was built under R version 4.0.4
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
library(naniar) # for analyzing missing values
## Warning: package 'naniar' was built under R version 4.0.4
library(vip) # for variable importance plots
## Warning: package 'vip' was built under R version 4.0.4
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
theme_set(theme_minimal()) # Lisa's favorite theme
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## hotel = col_character(),
## arrival_date_month = col_character(),
## meal = col_character(),
## country = col_character(),
## market_segment = col_character(),
## distribution_channel = col_character(),
## reserved_room_type = col_character(),
## assigned_room_type = col_character(),
## deposit_type = col_character(),
## agent = col_character(),
## company = col_character(),
## customer_type = col_character(),
## reservation_status = col_character(),
## reservation_status_date = col_date(format = "")
## )
## i Use `spec()` for the full column specifications.
When you finish the assignment, remove the # from the options chunk at the top, so that messages and warnings aren’t printed. If you are getting errors in your code, add error = TRUE so that the file knits. I would recommend not removing the # until you are completely finished.
Read the Quick Intro section of the Using git and GitHub in R Studio set of Course Materials. Set up Git and GitHub and create a GitHub repo and associated R Project (done for you when you clone the repo) for this homework assignment. Put this file into the project. You should always open the R Project (.Rproj) file when you work with any of the files in the project.
Task: Below, post a link to your GitHub repository.
You’ll be using RStudio to create a personal website to showcase your work from this class! Start by watching the Sharing on Short Notice webinar by Alison Hill and Desirée De Leon of RStudio. This should help you choose the type of website you’d like to create.
Once you’ve chosen that, you might want to look through some of the other Building a website resources I posted on the resources page of our course website. I highly recommend making a nice landing page where you give a brief introduction of yourself.
Tasks:
Include a link to your website below. (If anyone does not want to post a website publicly, please talk to me and we will find a different solution).
Listen to at least the first 20 minutes of “Building a Career in Data Science, Chapter 4: Building a Portfolio”. Go to the main podcast website and navigate to a podcast provider that works for you to find that specific episode. Write 2-3 sentences reflecting on what they discussed and why creating a website might be helpful for you.
(Optional) Create an R package with your own customized gpplot2 theme! Write a post on your website about why you made the choices you did for the theme. See the Building an R package and Custom ggplot2 themes resources.
tidymodelsRead through and follow along with the Machine Learning review with an intro to the tidymodels package posted on the Course Materials page.
Tasks:
hotels, on the Tidy Tuesday page it came from. There is also a link to an article from the original authors. The outcome we will be predicting is called is_canceled.Some variables that might be predictive are previous cancellations - High number of previous cancellations could mean a higher chance of cancellation previous_bookings_not_canceled - High number of no cancellations wculd mean a lower chance of cancellation deposit_type - Non Refund type deposits would probably be less likely to be cancelled than No Deposit or Refundable type deposits booking_changes - High number of booking changes could mean the stay is more prone to changes and so, cancellation lead_time - Stays booked a longer time in advance could be more likely to be cancelled than stays booked closer to the day of the stay.
Some issues with the data are possible privacy problems of hotel guests.
We would be able to predict which bookings are more likely to get canceled.
# deposit_type vs is_canceled
hotels %>%
count(deposit_type)
hotels %>%
group_by(deposit_type) %>%
summarise( tot_cancel = sum(is_canceled), tot_books = n() ) %>%
mutate(perc_cancel = tot_cancel/tot_books)
The hell?
#previous_cancellations vs is_canceled
hotels %>%
mutate(total_previous = previous_cancellations+previous_bookings_not_canceled) %>%
filter(total_previous != 0) %>%
mutate(prev_cancel_perc = previous_cancellations/total_previous) %>%
ggplot(aes(x = prev_cancel_perc, fill = (is_canceled==1))) +
geom_density(aes(alpha = 0.2))
#lead time vs is_canceled
hotels %>%
ggplot(aes(x = lead_time, fill =(is_canceled==1)))+
geom_density(aes(alpha= 0.2))
is_canceled. Since we have a lot of data, we’re going to split the data 50/50 between training and test. I have already set.seed() for you. Be sure to use hotels_mod in the splitting.hotels_mod <- hotels %>%
mutate(is_canceled = as.factor(is_canceled)) %>%
mutate(across(where(is.character), as.factor)) %>%
select(-arrival_date_year,
-reservation_status,
-reservation_status_date) %>%
add_n_miss() %>%
filter(n_miss_all == 0) %>%
select(-n_miss_all)
hotels_mod
set.seed(494)
hotels_split <- initial_split(hotels_mod,
prop = .75)
hotels_training<-training(hotels_split)
hotels_testing<- testing(hotels_split)
is_canceled as the outcome and all other variables as predictors (HINT: ~.).step_XXX() function or functions (I think there are other ways to do this, but I found step_mutate_at() easiest) to create some indicator variables for the following variables:children,babies, andprevious_cancellations`. So, the new variable should be a 1 if the original is more than 0 and 0 otherwise. Make sure you do this in a way that accounts for values that may be larger than any we see in the dataset.agent and company variables, make new indicator variables that are 1 if they have a value of NULL and 0 otherwise.fct_lump_n() to lump together countries that aren’t in the top 5 most occurring.step_normalize() to center and scale all the non-categorical predictor variables. (Do this BEFORE creating dummy variables. When I tried to do it after, I ran into an error - I’m still investigating why.)-all_outcomes() in this part!!).prep() and juice() functions to apply the steps to the training data just to check that everything went as planned.#alternative
hotels_recipe <- recipe(is_canceled ~ .,
data = hotels_training)
hotels_mod
hotels_recipe <- hotels_recipe %>%
step_mutate(children, fn = as.factor(children> 0),
babies, fn = as.factor(babies> 0),
previous_cancellations, fn = as.factor(previous_cancellations> 0)) %>%
step_mutate(agent = as.numeric(agent == "NULL"),
company = as.numeric(company == "NULL")) %>%
step_mutate(country = fct_lump_n(f = (country),5)) %>% #not sure about using step_mutate, did step_other
step_normalize(all_predictors(),
-all_nominal()) %>% #does this work?
step_dummy(all_nominal(),-all_outcomes())
hotels_recipe %>%
prep(hotels_training) %>%
juice()